Skip to content

Conversation

@cfbolz
Copy link

@cfbolz cfbolz commented Nov 13, 2019

When unpacking msgpack objects, store the keys that appear into a memo
dictionary to make them unique. This is useful, because for most sizable
msgpack files the same keys appear again and again, since many objects
have the same "shape" (set of keys). A similar optimization is done in
most json deserializers, eg in CPython:
https://github.com/python/cpython/blob/d89cea15ad37e873003fc74ec2c77660ab620b00/Modules/_json.c#L717

My totally unscientific results: I tried this on two big msgpack files,
a wikidata dump (92 MiB) and a dump of reddit comments (596 MiB). I am
reporting time spent deserializing and memory use of the resulting data
structure. I've included json deserialization numbers as a comparison.
The results I get on my old-ish laptop are:

wikidata
                      time   memory
CPython 3.7.5 before  3.42s  1279 MiB
CPython 3.7.5 after   3.43s   883 MiB
PyPy3 7.2 before      6.44s  1380 MiB
PyPy3 7.2 after       4.98s   965 MiB

CPython 3.7.5 json    4.13s   887 MiB
PyPy3 7.2 json        3.54s   958 MiB

reddit
CPython 3.7.5 before   5.62s  3412 MiB
CPython 3.7.5 after    5.20s  1754 MiB
PyPy3 7.2 before      14.72s  3782 MiB
PyPy3 7.2 after        8.37s  2086 MiB

CPython 3.7.5 json     8.64s  1753 MiB
PyPy3 7.2 json        10.52s  2052 MiB

For wikidata, there is only a memory improvement on CPython, the time
stays the same. For all other three variants (all of reddit, pypy on
wikidata) both time and memory improve significantly. The reason for the
memory improvements are due to the memoizing, time improves due to
better cache locality due to the smaller working set, and less time
spent in GC in the case of PyPy.

When unpacking msgpack objects, store the keys that appear into a memo
dictionary to make them unique. This is useful, because for most sizable
msgpack files the same keys appear again and again, since many objects
have the same "shape" (set of keys). A similar optimization is done in
most json deserializers, eg in CPython:
https://github.com/python/cpython/blob/d89cea15ad37e873003fc74ec2c77660ab620b00/Modules/_json.c#L717

My totally unscientific results: I tried this on two big msgpack files,
a wikidata dump (92 MiB) and a dump of reddit comments (596 MiB). I am
reporting time spent deserializing and memory use of the resulting data
structure. I've included json deserialization numbers as a comparison.
The results I get on my old-ish laptop are:

wikidata
                      time   memory
CPython 3.7.5 before  3.42s  1279 MiB
CPython 3.7.5 after   3.43s   883 MiB
PyPy3 7.2 before      6.44s  1380 MiB
PyPy3 7.2 after       4.98s   965 MiB

CPython 3.7.5 json    4.13s   887 MiB
PyPy3 7.2 json        3.54s   958 MiB

reddit
CPython 3.7.5 before   5.62s  3412 MiB
CPython 3.7.5 after    5.20s  1754 MiB
PyPy3 7.2 before      14.72s  3782 MiB
PyPy3 7.2 after        8.37s  2086 MiB

CPython 3.7.5 json     8.64s  1753 MiB
PyPy3 7.2 json        10.52s  2052 MiB

For wikidata, there is only a memory improvement on CPython, the time
stays the same. For all other three variants (all of reddit, pypy on
wikidata) both time and memory improve significantly. The reason for the
memory improvements are due to the memoizing, time improves due to
better cache locality due to the smaller working set, and less time
spent in GC in the case of PyPy.
PyErr_Format(PyExc_ValueError, "%.100s is not allowed for map key", Py_TYPE(k)->tp_name);
return -1;
}
if (PyUnicode_CheckExact(k) || PyBytes_CheckExact(k)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not make a big difference, but this can be combined with the previous condition in l.194 :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I fixed this.

ctx.user.encoding = encoding
ctx.user.unicode_errors = unicode_errors
Py_INCREF(d)
ctx.user.memo = <PyObject*>d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK <object>d will enable refcounting in Cython. Otherwise object and PyObject* are identical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I didn't manage to follow this suggestion, because the memo field is in a C struct, which doesn't support object fields.

@methane
Copy link
Member

methane commented Nov 13, 2019

I prefer interning because it speed up string comparison too.

@jfolz
Copy link
Contributor

jfolz commented Nov 13, 2019

Isn't interning a bit dangerous, i.e., those strings are there forever until the interpreter exits?

@methane
Copy link
Member

methane commented Nov 13, 2019

PyUnicode_InternImmortal() creates immortal string. It is bit dangerous as you said.
But PyUnicode_InternInplace() doesn't create immortal string.

@methane
Copy link
Member

methane commented Nov 13, 2019

BTW, No need to care about Python 2 and bytes object.
Optimize only unicode in Python 3.

@jfolz
Copy link
Contributor

jfolz commented Nov 13, 2019

Interesting:

Changed in version 2.3: Interned strings are not immortal

So it's been like that for a very long time and that story has just been retold again and again since. PyUnicode_InternImmortal seems to be undocumented - to keep people from using it since it's entirely unnecessary I presume. So interning is exactly as safe and more effective than a memo dict.

@cfbolz
Copy link
Author

cfbolz commented Nov 13, 2019

OK, happy to switch to interning in the C version. I would still stick with an explicit dict in the fallback version if that's OK?

@jfolz
Copy link
Contributor

jfolz commented Nov 13, 2019

Ideally fallback would behave identical to native code. Since you're a PyPy developer I suppose it doesn't play nicely there?

@methane
Copy link
Member

methane commented Nov 14, 2019

I would still stick with an explicit dict in the fallback version if that's OK?

What's wrong about sys.intern()?

@cfbolz
Copy link
Author

cfbolz commented Nov 15, 2019

Ok, I investigated, seems PyPy is also good about intern not leaking anything, I wasn't sure :-).

@methane methane closed this Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants